feat: add PDF OCR and metadata management capabilities #3

nedanwr · 2026-01-16T07:15:26Z

Description

This PR introduces comprehensive PDF OCR (Optical Character Recognition) and metadata management capabilities to forgekit. Users can now:

OCR: Add searchable text layers to scanned PDFs using ocrmypdf, with support for multiple languages, deskewing, and text detection
Metadata Management: Read and write PDF metadata (title, author, subject, keywords, etc.) using exiftool

The implementation includes new CLI commands (forgekit pdf ocr and forgekit pdf metadata), core library enhancements for job specifications, and tool adapters for Ocrmypdf and ExifTool.

Type of Change

Scope

cli - CLI commands and interface
core - Core library functionality
tools - Tool integrations (qpdf, ghostscript, etc.)
pdf - PDF-specific operations
packaging - Package configurations
Other:

Related Issues

Checklist

I have read the commit convention
My commits follow the conventional format: type(scope): description
I have added/updated tests as appropriate
All tests pass locally (cargo test)
Code passes lint checks (cargo clippy and cargo fmt)
I have updated documentation if needed

Breaking Changes

This PR contains breaking changes

Key Changes

CLI Commands

Added forgekit pdf ocr command with support for:
- Language selection (e.g., eng, deu, fra)
- Deskewing for tilted scans
- Force OCR mode for re-processing
- Skip-text mode for mixed PDFs
Added forgekit pdf metadata command with support for:
- Reading all metadata (forgekit pdf metadata doc.pdf)
- Getting specific fields (--get title)
- Setting multiple fields (--set title="My Doc" --set author="John")

Core Library

Added Ocrmypdf tool adapter for OCR operations
Added ExifTool adapter for metadata management
Enhanced JobSpec with PdfOcr and PdfMetadata variants
Added MetadataAction enum for metadata operations (Get, GetAll, Set)
Integrated OCR and metadata operations into job executor with progress reporting

Version

Bumped version to 0.0.5 for core packages and workspace

…ed PDF processing

…progress reporting

…etadata management

…nd OCR capabilities

… and metadata features

coderabbitai · 2026-01-16T07:15:39Z

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

…ns from PathBuf to Path for improved type handling

…treamline parsing

nedanwr added 10 commits January 16, 2026 15:10

feat(cli): add OCR and metadata management commands for PDF processing

4f03f8c

feat(cli): add Exiftool and Ocrmypdf tools to check dependencies

bb999a5

feat(core): implement ExifTool adapter for PDF metadata management

e0d5513

feat(core): add Ocrmypdf tool adapter for OCR operations on PDFs

4d0c2d9

feat(core): add Exiftool and Ocrmypdf modules to tools for enhanc…

36c556f

…ed PDF processing

feat(core): implement PDF OCR and metadata management functions with …

f6d4f56

…progress reporting

feat(core): add MetadataAction to job specifications for enhanced m…

7613213

…etadata management

feat(core): enhance job specifications with PDF metadata management a…

6ae2ab8

…nd OCR capabilities

chore: bump version to 0.0.5 for core packages and workspace

f163920

chore: update ROADMAP.md to reflect completion of v0.0.5 with PDF OCR…

9b1c0f7

… and metadata features

nedanwr self-assigned this Jan 16, 2026

nedanwr added 3 commits January 16, 2026 15:21

refactor(core): change input and output types in PDF metadata functio…

05f79a4

…ns from PathBuf to Path for improved type handling

refactor(core): implement FromStr trait for PdfMetadataField to s…

4e28f14

…treamline parsing

test(core): improve readability of PDF metadata field parsing tests

71e13ac

nedanwr merged commit 4fde02e into develop Jan 16, 2026
5 checks passed

nedanwr deleted the feat/pdf-ocr-metadata branch January 16, 2026 15:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add PDF OCR and metadata management capabilities #3

feat: add PDF OCR and metadata management capabilities #3

Uh oh!

nedanwr commented Jan 16, 2026 •

edited

Loading

Uh oh!

coderabbitai bot commented Jan 16, 2026

Review skipped

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: add PDF OCR and metadata management capabilities #3

feat: add PDF OCR and metadata management capabilities #3

Uh oh!

Conversation

nedanwr commented Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of Change

Scope

Related Issues

Checklist

Breaking Changes

Key Changes

CLI Commands

Core Library

Version

Uh oh!

coderabbitai bot commented Jan 16, 2026

Review skipped

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nedanwr commented Jan 16, 2026 •

edited

Loading